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TITLE: HEAD RELATIONAL TRANSFER FUNCTION VIRTU ALIZER 

20 FIELD OF THE INVENTION 

The invention relates to spatial audio systems and in particular relates to systems and 
methods of producing, adjusting and maintaining natural sounds, e.g., speaking voices, in 
a telecommunication environment. 

25 BACKGROUND 

Computer Telephone Integrated (CTI) audio terminals typically have multiple speakers or 
a stereo headset. The existence of multiple audio sources, and the flexibility in placing 
them, particularly in the case of computer audio speakers, creates the means to recreate a 
proper perspective for the brain to resolve the body's relationship to an artificial or 
30 remote speaking partner. Telephone handsets and hands-free audio conferencing 

terminals do not take into account the relative position between the one or more speaking 



persons and their audience. Present devices simulate a single point source of an audio 
signal that emanates typically from a fixed position, whether it is sensed via compression 
diaphragm of the handset or the speaker of a teleconferencing system. 



5 The relationship between this point source to the rest of the listener's body, specifically, 
his/her head, ears, shoulders, and chest, is drastically different compared how the 
relationship will be if the two participants were to speak face to face. The inaccurate 
portrayal of this relationship creates a pyschoacoustical phenomenon termed "listener's 
fatigue," produced when the brain cannot reconcile the auditory signal to a proper audio 
10 source, and over time this incongruity results in varying degrees of psychosomatic 
discomfort when the brain is confronted with this situation for a period of time. 

FIG. 1 illustrates a system 100 where a listener 126 exchanges audio signals with a 
remote human speaker 102. While both listener 126 and human speaker 102 may have 

15 similar interposed signal processing devices, only those elements necessary for 
illustrating the prior are illustrated. The user or listener 126, perceiving his or her 
counterpart, the human speaker 102 or source, as a flat sound wall 128 emanating from a 
left audio speaker 122 and a right audio speaker 124, for example. The flat sound wall 
128 is not a realistic representation of an actual human audio source. In this example, a 

20 human speaker 102 is within pickup range of a microphone 104. The microphone 104 
connects to a computer 106 wherein the audio signals are converted into a format 
compatible with being transmitted to the listener. For transmission via a Public Switched 
Telephone Network (PSTN) or other circuit switched system, the microphone interface 
108 may perform analog anti-aliasing filtering before sending the analog signal to a 

25 coder-decoder for sampling, quantizing, and compressing the digital stream to be 
expanded and converted to analog signals on the receiving end. Alternatively, the 
digitized audio signals, particularly compressed and encoded voice signals, may be 
transmitted as data packets over a network such as the Internet. The Voice-over Internet 
Protocol (VoIP) is an example of such an Internet protocol that may use a Session 

30 Initiation Protocol (SIP) to define the VoIP switching fabric. From a communication 
processing interface like a VoIP interface 110, the voice data packets leave the human 
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speaker's computer 106 and travel via the Internet 110 to the listener's computer 112. 
The listener's communication processing interface like the VoIP interface 1 14 of the 
listener's computer reconstructs the media stream into the monaural signal 117 similar to 
the signal recorded at the speaker's microphone 104. The destination processing 112 
5 applies forms of spatial audio filtering 1 16 to shape the monaural signal 1 17 to then be 
sent to two or more audio speaker drivers 118. With equalization filtering alone, the pair 
of audio speakers 122,124 are perceived by the listener as being a flat source 128 that is 
equidistant between the two audio speakers 122, 124. Techniques are available for 
processing monaural signals to laterally translate the perceived source location 132 to the 

10 left or right of center by varying a transport delay between the two channels of a set of 
headphone, e.g., binaural processing. The left audio speaker 122 and right audio speaker 
124 of the example illustrated in FIG. 1 may be spaced, for computer-based telephony 
interface layouts, at +5 degrees and - 5 degrees respectively from an axis having an 
origin at the listener and extending to and perpendicular with the audio speaker array. For 

15 teleconferencing environments, that spacing may be increased to +30 and -30 degrees. 
This audio speaker spacing produces crosstalk at the left and right ears of the listener. 
With transaural processing applied to cancel or substantially reduce crosstalk between 
audio speakers channels, the perceived audio effect can be enhanced. The perceived 
effect audio source translation is adjustable by the listener. 

20 

Psychoacoustic characteristics of the sound may be exploited in whole or part to create a 
perceived change in distance. Psychoacoustic characteristics of the sound of a source 
increasing in distance from the listener include: quieter due to the extra distance traveled, 
less high frequency content principally due to air absorption; more reverberant 

25 particularly in a reflective environment; less difference between time of direct sound and 
first floor reflection creating a straight wave front: and attenuated ground reflection. An 
additional spatial filter effect that follows is to lower the intensity, or volume, attenuate 
the higher frequencies, and add some forms of reverberation, for example, whereby the 
listener perceives the audio source increasing in distance from the listener. Again, this 

30 perceived effect is adjustable by the listener. Thus, the perceived audio source can be 



3 



translated to the left for example 132, translated in added distance 130 or a combination 
of left translation and added distance 134. 



For each ear of the listener, the Head-Related Impulse Response (HRIR) characterizes the 
5 impulse response, h(t), from the audio source to the ear drum, that is, the normalized 
sound pressure that an arbitrary source, x(t), produces at the listener's ear drum. The 
Fourier transform of h(t) is called the Head-Related Transfer Function (HRTF). The 
HRTF captures all of the physical cues to source localization. For a known HRTF for the 
left ear and the right ear, headphones aid in synthesizing accurate binaural signals from a 

10 monaural source. In the application of classical time and frequency domain analysis, the 
HRTF can be described as a function of four variables, i.e., three space coordinates and 
frequency. In spherical coordinates where distances are greater than about one meter, the 
source is said to be in the far or free field, and the HRTF falls off inversely with range. 
Accordingly, most HRTF measurements are free field measurements. Such a free field 

15 HRTF database of filter coefficients essentially reduces the HRTF to a function of 

azimuth, elevation and frequency. For a readily implementable system, the HRTF matrix 
of filter coefficients is further reduced to a function of azimuth and frequency. 

For audio frequency, (0, an angle in azimuth, <|>, in the horizontal plane, and an angle in 
20 the vertical plane, 8, the Fourier transform of the sound pressure measured in the 

listener's left ear can be written as Pprobe, lefKM <!>> 8) an d the Fourier transform for the 
free field, independent of sound incidence, can be written as PreferenceCM ty> 8), where j 
represents the imaginary number, V-T - Accordingly, the free-field (ff) head-relative 
transfer function for the listener's left ear can be written as 

25 

H FF, LEFT(jft>> <l>> 8)= [PpROBE, LEFtG W > 4>» §)]/[ PrEFERENCeG 0 ^ <!>> 8)]. 

The HRTF then accounts for the sound diffraction caused by the listener's head, torso 
and, given manner in which measurement data are taken, outer ear effects as well. For 
30 example, the left and right HRTF for a particular azimuth and elevation angle of 
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incidence can evidence a 20 dB difference due to interaural effects as well a 600 micro 
second delay (where the speed of sound, c, is approximately 340 meters/second). 

In the case of a listener with headphones, the typically binaural spatial filtering may 
5 include an array of HRTFs that when implemented as impulse response filters, are 
convolved with the monaural signal to produce a perceived effect of hearing a natural 
audio source, that is one having interacted with the head, torso and outer ear of the 
listener. FIG. 2 illustrates the case of audio speakers, particularly an array having a left 
audio speaker 122 and a right audio speaker 124 where, as part of the listener's 

10 processing interface 112, the spatial filtering includes the convolution of filters 

representing HRTFs as well as transaural processing to cancel the crosstalk. HRTF 
databases, most commonly for a free field plane, are available and are mechanized as 
filters with tunable or otherwise adjustable coefficients. The listener can select nominal 
filters for the left and right ear as listener inputs 121. The HRTF adjustments 216 may be 

15 for left and right translation where channel-to-channel delay may be employed, or may be 
for increased distance where intensity decrease, high frequency attenuation and 
reverberation may be introduced or may be for enhancing the natural sound of the audio 
speakers 122, 124 where coefficients of the filters representing the HRTF database 214 
may be adjusted, or any combination thereof. The resulting filters, amplitudes and delays 

20 are convolved with the reconstructed monaural source 117 with the two channels being 
equalized, and transaurally corrected 212 before the signals are sent to the audio speakers 
122, 124. 

FIG. 3 illustrates a monaural microphone and an example of its spherical coordinate 
25 system 300. From a first reference axis 302, x, one subtends an azimuth angle 304, 0, 
one next subtends an elevation angle 306, 5. Along this directional vector, the audio 
source 102 lies a distance, p, from the microphone origin 301, O. Other microphones 
have left and right microphones integral to a single device, i.e., coincident, providing 
directionality principally from the pressure differences. FIG. 4 illustrates a coincident 
30 microphone 402 having two principal sensing elements in a horizontal plane 400. In the 
horizontal plane, the audio source 102 subtends an azimuth angle 304, 0, from the 
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reference axis 302, x, and lies a distance 407, po, from coincident microphone 402. By 
differencing the pressure sensed by the two elements for example, the azimuth angle 304, 
0, can be measured. 



5 FIG. 5 illustrates an example of a two-dimensional microphone array that has 

microphones in an array 502 distributed linearly, each at an equal distance, d 504, from 
one another. For an azimuthal angle of incidence 304, <j> from an audio source 102 distant 
enough from the microphone array 502 to produce a substantially linear wave front 506, 
the wave front 506 time of arrival delay between each microphone is characterized as an 

10 inverse z-transform: 

-1 -(jd(0/c)cost<l> 

z = e 

The frequency response for an array of n such equally spaced microphones is expressed 
as: 

n-l 
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-j(0)/c)ndcos(j) 



n=0 



Because the response functions as a spatial filter, a n may be adjusted and/or shaped with 
finite impulse response filtering to steer the array to an angle <|)o by inputting a time delay. 

With the speed of sound, c, a nominal time delay, to, is set with 
20 f 0 >nd/c 



j&L +j(P>fc)ndcori) 

On=e e 



With the adjustment of a n within the effective steerable array spatial filter, the 2D array 
25 of microphones are steerable to In addition, conditioning the output of each 

microphone with a finite impulse response filer, the n-l nulls are available to be placed at 
n-l frequencies to notch out and otherwise mitigate discrete, undesired, noise sources. 
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The steerable array may employ passive sweeps, or infrared optics to augment source 
locations. 



Stereophonic microphones are separated by distances that often precluding steerability, 
5 but providing time delay information nonetheless. For example, with two coincident 
microphones separated by a known distance, di f 2, as illustrated in FIG. 6, the angle 
incidence to each, (|) 1 632 and <j>2 630 is measured from which both pi 606 and P2 608 

may be determined as well as the distance, Si 614, from the array. For example, applying 
the Law of Sines: 

10 

pi= [d u sin ())i]/[sin(7r-(t)r(t)2)]; 

p2= [di, 2 sin (|)2]/[sin(7C-(()i-(|)2)]; and 
15 Si = Pi sin <|>2 = P2 sin 

Where omnidirectional or coincident microphones 402 may provide inadequate 
resolution of their respective angles of incident, a steerable array of microphones 602 can 
be exchanged for each to enhance coincident microphones resolution. Also illustrated in 
20 FIG. 6 is the arrangement where an audio source 102 is directly aligned with one 

microphone position. In such an arrangement, any other microphone positions along a 
2D array line will sense the audio source signals with delay relative to the first 
microphone position. This delay and known microphone positions are used to resolve the 

distance, S2 612, which should be substantially the same as p3 610, of the audio source 
25 102 from the array 602 and can be used to refine the angle of incidence, §\ 632 and <j>2 
630, for those microphones not directly in line, with the audio source 102. 
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SUMMARY 

The present invention in its several embodiments includes a method of and system for 
processing sound data received at a microphone. The method includes the steps of: 
receiving a transmission having sound data and an audio source spatial data set relative to 
the microphone; using a sound conditioning filter database having filters characterized by 
a stored set of coefficients wherein each stored set of filter coefficients is a function of at 
least one element of the audio source spatial data set, to determine two or more stored 
sets of coefficients proximate to the at least one element of the audio source spatial data 
set; interpolating between the determined two or more stored sets of coefficients; 
convolving the sound data with a shaping filter having the interpolated filter coefficients; 
and then transmitting the resulting signal to a sound-producing device. A preferred 
embodiment accommodates a spatial data set having a first angle of incidence relative to 
the microphone, a second angle of incidence relative to the microphone substantially 
orthogonal to the first angle of incidence, or a distance setting relative to the microphone, 
or any combination thereof. A second embodiment of the method of for processing sound 
data received at a microphone includes steps of: transmitting sound waves toward a 
subject having a torso and a head via an audio speaker array; receiving the reflected 
sound waves via a microphone array; processing the received sound waves to determine 
time-relative changes in subject head orientation and subject torso orientation; translating 
the determined time-relative changes in subject orientation into changes in an audio 
source spatial data set using a sound conditioning filter database having filters 
characterized by a stored set of coefficients wherein each stored set of filter coefficients 
is a function of at least one element of the audio source spatial data set, to determine two 
or more stored sets of coefficients proximate to the at least one element of the audio 
source spatial data set; interpolating between the determined two or more stored sets of 
coefficients; convolving the sound data with a shaping filter having the interpolated filter 
coefficients; and transmitting the resulting signal to a sound-producing device. Example 
sound-producing devices that support effective three dimensional (3D) audio imaging 
includes headphones and audio speaker arrays. 
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The several system embodiments of the present invention for spatial audio source 
tracking and representation include one or more microphones; a microphone processing 
interface for providing a sound data stream and an audio source spatial data set; a 
processor for modifying spatial filters based on the audio source spatial data set and for 
5 shaping the sound data stream with modified spatial filters; and a sound-producing array, 
e.g., headphones or an array of audio speakers. As with the method embodiments, the 
spatial data set include an audio source distance setting relative to the one or more 
microphones and a first audio source angle of incidence relative to the one or more 
microphones either separately or in combination and may include a second audio source 

10 angle of incidence relative to the one or more microphones, the second audio source 
angle of incidence being substantially orthogonal to the first audio source angle of 
incidence. In some embodiments, the system also includes a first communication 
processing interface for encapsulating the sound data and an audio source spatial data set 
relative to the one or more microphones into packets; and transmitting via a network the 

15 packets; and a second communication processing interface for receiving the packets and 
de-encapsulating sound data and the audio source spatial data set. In some embodiments, 
the system also includes a first communication processing interface for encoding the 
sound data and an audio source spatial data set relative to the one or more microphones 
into telephone signals; and transmitting via a circuit switched network; and a second 

20 communication processing interface for receiving the telephone signal and de-encoding 
the sound data and the audio source spatial data set. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example and not limitation in the figures of 
25 the accompanying drawings, and in which: 

FIG. 1 illustrates a speaker-listener session of the prior art; 

FIG. 2 illustrates the incorporation of HRTFs of the prior art; 

FIG. 3 illustrates a microphone-centered spherical reference frame of the prior art; 

FIG. 4 illustrates a microphone-centered polar reference frame of the prior art; 
30 FIG. 5 illustrates a steerable microphone array of the prior art; 
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FIG. 6 illustrates a coincident microphone array for determining relative angle of 
incidence and relative distance of the present invention of the prior art; 
FIG. 7 illustrates a speaker-listener session embodiment of the present invention; 
FIG. 8 illustrates a functional block diagram of an embodiment of the present invention; 
5 FIG. 9 illustrates a speaker-listener session embodiment of the present invention; 

FIG. 10 illustrates a functional system block diagram of an embodiment of the present 
invention; 

FIG. 1 1 illustrates a functional block diagram of an embodiment of the present invention; 
FIG. 12 illustrates a tuning embodiment of the present invention; and 
10 FIG. 13 illustrates a tuning embodiment of the present invention. 

DETAILED DESCRIPTION 

FIG. 7 illustrates voice data transmission from a human speaker 102 to a human listener 
15 126 via a first voice processing device 106 and a second voice-processing device 1 12 
operably connected by a network such as the Internet 1 10. In this example, a coincident 
microphone 402 captures the voice of the human speaker 102. A steerable array of 
microphones 502 or a distributed array 602 of coincident microphones 402 or 
omnidirectional microphones are alternatives that may be preferred for teleconferencing. 
20 The microphone interface 108 may include filters necessary to shape the audio signals 
prior to digitization to minimizing aliasing effects, for example. The microphone 
interface 108 may include sampling and quantizing the signal to produce a digital stream. 
The microphone interface 108 may also include digital signal processing for deriving an 
angle of incidence of the audio source 102 in a measurable plane and may include nulling 
25 or notching filters to eliminate noise sources directionally. 

Conceptually, the voice data is transmitted via a data plane. In implementation, the 
captured voice for example is, in the preferred embodiment, converted into a format 
acceptable for transmission over the Internet such a VoIP thereby encapsulating the voice 
30 data with destination information for example. The second voice-processing device 112 
de-encapsulates the voice data from the VoIP protocol 114 into a monaural digital signal 
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117. The monaural signal 117 is convolved with spatial audio filtering 116, converted via 
speaker drivers 118 to drive two channels in this example each having an audio speakers 
122, 124. The listener may have indicated 121 selections, via an interface 120 for the 
spatial audio filtering to draw from a bank of HRTFs that are either close to the listener in 
5 acoustical effect or tuned for the listener. In the preferred operation, the resulting effect 
is an audio source for the listener that is more natural and in this example, the audio 
"image" may be centered between the two audio speakers, moved left or right of center 
by the listener and given frequency response shaping, reverberation and amplitude 
reductions that may produce an effect of a more distant source. While the HRTF has in 

10 the past been described and analyzed according to classical time and frequency domain 
analysis, it is important to note that the same relationships can be alternatively modeled 
in the wavelet domain, i.e., instead of describing the model as a function of time, space, 
or frequency; the same model can be described as a function of basis functions of the one 
or more of the same variables. This technique, as well as other modern mathematical 

15 techniques, such as fractal analysis, a modeling technique based on self-similarity of 
multivariable functions, may be applied in some embodiments with intent of achieving 
greater processing and storage efficiencies with greater accuracy than that the classical 
methodologies. 

20 In an embodiment of the present invention illustrated in FIG. 7, the microphone interface 
708 in addition to other signal processing functions, derives an angle of incidence, 0, for 
the voice of the human speaker 102 preferably relative to the microphone 402 or center of 
the microphone array 502, 602, for example. Conceptually, this angle of incidence may 
be communicated on the signal plane. In a preferred implementation, this derived angle 

25 incidence, 0, as source-to-microphone relative spatial data 71 1 is encapsulated along with 
the voice data 709 with an extended VoIP 710, accommodating this data, and the data is 
transmitted as packets 140, 150 via a network 110 to a second VoIP processing device 
1 12 enabled to de-encapsulate the extended VoIP data packets at the communication 
processing interface 714 having angle of incidence, 0, data into a reconstructed monaural 

30 signal 117 and the reconstructed source-to-microphone relative spatial data 717. The 

spatial filtering of the second VoIP processing device 1 12 includes the angle of incidence 
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information by interpolating 716 the selected HRTFs to account for an angle of incidence 
if not already overridden by the listener via listener inputs 121 at the listener interface 
120. In this example, the human speaker 102 is left of center of a microphone assembly 
402 or array 502, 602, With the listener 126 having set the source preference to be that 
5 the human speaker acoustical image is nominally facing the listener when the listener is 
facing the audio speaker array 122, 124, then the resulting "imaged" audio source 728 is 
perceived to be right of center of the audio speaker array 122, 124. In addition, the 
listener may choose to add depth cues to push off the perceived distance of the translated 
human speaker 730 to be aft to the audio speaker array. Alternatively, the listener 126 
10 may select to ignore the angle of incidence information in the processing of his spatial 
filtering of the monaural signals, leaving the "imaged' source to be in the center 128 of 
the speaker array 122, 124. The user may add distance effects 130 if he so desires. 

As illustrated in FIG. 8, the first transmitted angle of incidence, the second transmitted 

15 angle of incidence substantially orthogonal to the first transmitted angle of incidence, or a 
relative distance setting or any combination 717 is used to drive the interpolation 804 of 
the HRTF database to a solution of filter coefficients between previously quantified 
incident angles, i.e., those having filter coefficient arrays based on acoustical 
measurements, so that the convolution includes the spatial filters adjusted for one or both 

20 of the transmitted incidence angles. In embodiments having planar implementations, the 
HRTFs may be a function of frequency and azimuth angle. In a horizontal plane HRTF 
interpolation example, the interpolation can be a linear interpolation of the HRTF 
coefficients for the stored azimuth angles of incidence that bound the derived azimuth 
angle of incidence. While the above example is illustrated in a horizontal plane, the 

25 invention is readily extended to a three-dimensional array where the microphone array 
and audio speaker array is in a plane rather than linear. In the three-dimensional 
implementation, the HRTFs may be a function of frequency, azimuth and elevations 
angles of incidence where the range is removed in free field implementations. In a 
horizontal and vertical HRTF interpolation example, the interpolation can be a linear 

30 interpolation of the HRTF coefficients for the stored azimuth and elevation angles of 
incidence pairs that bound the derived azimuth angle of incidence and the derived 
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elevation angle. Conceptually, this is interpolating to a point within a parallelogram 
region defined by the stored coefficients as functions of pairs of azimuth and elevation 
angles of incidence. Higher order and nonlinear interpolations may be applied where 
appropriate to properly scale the perceived effect. Where interpolation is inadequate to 
5 supply the shaping sought for the acoustical "image" for all expected angles of incidence, 
then increasing the resolution of the HRTF database may be required. 

In FIG. 9, the speaking human 102 moves from a first location to a second location 
during a session where the distance relative to the microphone 402 or microphone array 

10 502, 602 is characterized as a vector 902 having time differences in measured angles of 
incidence and differences in perceived distance settings. The microphone interface 
processing 706 of the microphones 402 or microphone array 502, 602 in this example for 
the first location may yield an initial angle of incidence of sufficient quality to be 
included along with the voice data in data packets and transmitted over a network. The 

15 listener interface processor 112 processes 716 the angle of incidence and places the 
perceived audio source to the right of center of the two audio speakers 728. This is an 
automatic nominal setting. The listener can override this effect and may adjust the filters 
to induce a distancing effect 730 for a listener-selected nominal position of the acoustical 
"image." The new position of the human speaker is derived from the microphone 

20 processing 708 and via the VoIP communication processing interface 710, whereby the 
new angle of incidence is transmitted to effect, in the signal processing 716, the 
interpolation 804 in the signal processing 716 of the coefficients of the HRTFs. In this 
example, the microphone processing also derives a relative change in the distance of the 
human speaker 102 relative to a reference point of the microphones 402 or microphone 

25 array 502, 602. As with the derived angle of incidence, the derived relative distance may 
be included as relative spatial data 711 along with the voice data 709 in data packets 
preferably the VoIP 710 and transmitted over a network 110. The listener interface 
processing 112 may then account for the change in angle of incidence 910 from a 
nominal derived position 728 or may then account for the change in derived relative 

30 distance 730, or account for both 912. If the listener set a perceived distance 914 or angle 
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or both for the human speaker, then the listener interface processing may account for the 
change in angle of incidence 920, change in distance 916, or both 918. 

FIG. 10 illustrates an example of an embodiment of the system in one direction of 
5 transmission with the understanding that the bi-directional transmission is intended as 
well with each participant in the voice exchange having the necessary devices and 
functionality. The microphones or microphone array 1010 is connected with the 
computer 106 of the human speaker 102. The microphone signal processing 708 may 
include analog filters to mitigate aliasing for example and digital filters for setting nulls 

10 or notches and for reducing cross-talk for example. If available, the microphone signal 
processing 708 determines one or both of the angles of incidence and the nominal 
distance setting of the human speakerl02 relative to the microphone array 1010, i.e., the 
voice origin data 711. The determined relative angle of incidence and relative distance 
settings are prepared 1012 to be added to packets according to the VoIP and then the 

15 voice data 109 are encapsulated along with the voice origin data 711 according to the 
enhanced VoIP communication processing interface 1014. With a session established 
1018, 1019, the voice and voice origin data are sent to the listener via the Internet 110. 
The computer of the listener 112 receives the data packets 150 and de-encapsulates the 
voice data packets according to the enhanced VoIP communication processing interface 

20 1016. The voice data provides the monaural signal 1 17 and the voice origin data 717 

may be used, depending upon the settings 1040 input by the listener 126 via the HR filter 
interface 120, in the HRTF interpolation 804 of spatial filter coefficients 214 for the 
conditioning 1020 of the monaural signals 117. Also illustrated is a pathway via the 
listener microphone or microphone array 1030 whereby the listener 126 may, in some 

25 embodiments, effect by his voice characteristics 1031, changes in the interpolation by the 
microphone or microphone array processing 1008 determining changes the listener' state 
1042, particularly changes in the listener's relative angle of incidence to, and changes in 
the listener's relative distance from, the microphone or microphone array. This same 
pathway may be exploited passively in some embodiments to process acoustical waves 

30 originally emanating from the acoustical speaker array 1032 and diffusing 1034 from the 
listener's body and body parts particularly including the head and torso. 
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FIG. 1 1 illustrates in an expanded view the functional block diagram of the passive 
pathway process where acoustical waves are reflected 1034 by the listener's head or 
torso, or both 1102, and registered by the listener's microphone or microphone array 
5 1030. The frequency content of the acoustical waves are preferably selected to provide 
the most probative effect of the changes in the listener's orientation where interpolation 
may readily effect improvements and corrections to the perceived source. Filters 
downstream from the microphone or microphone array may be employed to eliminate or 
otherwise ameliorate unwanted sound sources proximate to the listener. The corrective 
10 potential of this passive path is enhanced with additional audio speakers, with additional 
microphones and with an anechoic environment. 

FIG. 12 illustrates an example array of microphones and an example array of acoustical 
speakers where the listener 126 originally sets 120, 121 the HR filters to a desirable 

15 acoustical "image" of the human speaker source. The listener moves away from the front 
microphone and turns to the place head and torso at an angle relative to the front line of 
audio speakers 1202. To the extent these changes in listener orientation are discernable 
by the microphones and microphone signal processing, there is then an automatic 
adjustment, via the interpolation of HRTF bank, with the resulting acoustical image being 

20 corrected for the listener's change in orientation. The acoustical measurements may also 
be augmented with passive optical sensing and by manual adjustments of the listener. 
FIG. 13 illustrates, together with FIG. 12, a translation only example of exploiting the 
listener microphone or microphone array 1030 pathway where the acoustical speaker 
array includes, for example, left and right audio speakers 122, 124, and additional left and 

25 right audio speakers 1222, 1224 that are responsive to the relative changes in the 

listener's relative translational position and rotational position 1202. If done actively, the 
microphone processing 1008 is principally dependent upon the voice of the listener 126. 
If done passively, the process is similar to the passive process as described and illustrated 
in FIG. 12. 

30 
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Where headphones are used by the listener, true binaural effect achieved without the need 
for the much transaural processing, if any, of the audio speaker embodiments. But, 
preferably head-tracking is employed to accommodate the listener rotation in the 
interpolation process to "stabilize" the perceived location of the audio source. 

While the above examples have been with data packets typical of Internet-based 
communications, the invention in other embodiments is readily implementable via 
encoding on switched circuits, for example in a Integrated Services Digital Network 
(ISDN) preferably with users having computer telephony interfaces. 

The words used in this specification to describe the invention and its various 
embodiments are to be understood not only in the sense of their commonly defined 
meanings, but to include by special definition in this specification structure, material or 
acts beyond the scope of the commonly defined meanings. Thus if an element can be 
understood in the context of this specification as including more than one meaning, then 
its use in a claim must be understood as being generic to all possible meanings supported 
by the specification and by the word itself. 

Many alterations and modifications may be made by those having ordinary skill in the art 
without departing from the spirit and scope of the invention and its several embodiments 
disclosed herein. Therefore, it must be understood that the illustrated embodiments have 
been set forth only for the purposes of example and that it should not be taken as limiting 
the invention as defined by the following claims. 
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